Hi all,
This markdown presents a simple simulation to demonstrate the intuition behind how topic models work.
I’m borrowing pretty much directly from the Silge and Robinson (2016) bookdown on tidytext.
Here’s the broad plan: [1] We’ll read in some the full text, chapter-wise, of 4 classic literary works from gutenberg.
[2] Each chapter becomes a document.
[3] Each book, that presumably connects its chapters via a common running theme, is our topic. [4] We will try to recover each topic or book just by topic-analyzing the documents and their text.
[5] Since we know a priori what the topics are, we can measure how well our model performed in topic recovery.
So let’s get started.
Recall gutenbergr package we’d covered previously.
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.3.4 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.4.3
# Consider 4 classics whose unlabeled chapters have been jumbled up.
titles <- c("Twenty Thousand Leagues under the Sea",
"The War of the Worlds",
"Pride and Prejudice",
"Great Expectations")
# using LTMs to reorganize the chapters
library(gutenbergr)
# system.time({
books <- gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = "title")
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
# }) # 5 secs
books # tibble: 51,663 x 3. cols = {gutenberg_id, text, title}
Next, we split each book into chapters, each chapters into tokens and finally all that into a DTM.
As pre-processing, we divide these into chapters, use tidytext’s unnest_tokens() to separate them into words, then remove stop_words.
We’re treating every chapter as a separate “document”, each with a name like Great Expectations_1 or Pride and Prejudice_11. Thsi will later help assess topic recovery.
library(stringr)
# divide each book chapter into separate document.
by_chapter <- books %>%
group_by(title) %>%
# detecting chapters using regex
mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, title, chapter)
# split into words
by_chapter_word <- by_chapter %>%
unnest_tokens(word, text)
# find document-word counts
word_counts <- by_chapter_word %>%
anti_join(stop_words) %>%
count(document, word, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
word_counts # A tibble: 104,722 x 3. cols = {doc, word, n}
# now cast into dtm
chapters_dtm <- word_counts %>%
cast_dtm(document, word, n)
chapters_dtm # DocumentTermMatrix (documents: 193, terms: 18215)
## <<DocumentTermMatrix (documents: 193, terms: 18215)>>
## Non-/sparse entries: 104722/3410773
## Sparsity : 97%
## Maximal term length: 19
## Weighting : term frequency (tf)
Time now to fit an LTM on the chapters, with K=4 corresponding to the 4 books.
Remember what we are doing and why. The aim is to restore these inividual, disorganized chapters to their original books, a challenging problem since the individual chapters are unlabeled. Thus, we don’t know what words might distinguish them into groups.
library(topicmodels)
system.time({
chapters_lda <- LDA(chapters_dtm, k = 4, control = list(seed = 1234))
}) # 30 secs
## user system elapsed
## 20.18 0.02 20.19
chapters_lda # A LDA_VEM topic model with 4 topics
## A LDA_VEM topic model with 4 topics.
So far so good. Remember, we just topic analyzed four full-length novels.
Pretty much like what we’d done previously. Next too, following what we did previously, we’ll analyze LTM output. Behold.
# examine per-topic-per-word probabilities.
chapter_topics <- tidy(chapters_lda, matrix = "beta")
chapter_topics # tibble: 72,860 x 3
# use dplyr's top_n() to find the top 5 terms within each topic.
top_terms <- chapter_topics %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms # # A tibble: 20 x 3
This tidy output lends itself well to a ggplot2 visualization. In particular, the face_wrap() facility enables a grid of plots to emerge.
# Now visualize via ggplot
library(ggplot2)
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
Well, well. What do we have? These topics seem pretty clearly associated with the four books!
E.g., the topic underlying “captain”, “nautilus”, “sea”, and “nemo” is almost surely Twenty Thousand Leagues Under the Sea, just as “martians”, “black”, and “night” come from The War of the Worlds.
Each document in this analysis represented a single chapter. Thus, we may want to know which topics are associated with each document.
Can we put the chapters back together in the correct books? We can find this by examining the per-document-per-topic probabilities, “gamma”.
Doing the gamma wala analysis below.
# Per-document classification
chapters_gamma <- tidy(chapters_lda, matrix = "gamma")
chapters_gamma # A tibble: 772 x 3
# re-separate the document name into title and chapter
chapters_gamma <- chapters_gamma %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE)
chapters_gamma # A tibble: 772 x 4
How to interpret the above? Each gamma value is an estimated proportion of words from that document that are generated from that topic.
For example, the model estimates that each word in the Great Expectations_57 document has only a 0.00135% probability of coming from topic 1 (Pride and Prejudice).
Now that we have these topic probabilities, we can see how well our unsupervised learning did at distinguishing the four books.
We’d expect that chapters within a book would be found to be mostly (or entirely), generated from the corresponding topic.
First we re-separate the document name into title and chapter, after which we can visualize the per-document-per-topic probability for each topic.
# reorder titles in order of topic 1, topic 2, etc before box-plotting
chapters_gamma %>%
mutate(title = reorder(title, gamma * topic)) %>%
ggplot(aes(factor(topic), gamma)) +
geom_boxplot() +
facet_wrap(~ title)
Below, we first find the topic that was most associated with each chapter using top_n(1), which is effectively the “classification” of that chapter.
Again, intuitive workflow. Nothing very complicated.
# finding topic most associated with each chapter using max value, i.e., top_n(1)
chapter_classifications <- chapters_gamma %>%
group_by(title, chapter) %>%
top_n(1, gamma) %>%
ungroup()
chapter_classifications # A tibble: 193 x 4
Now, let us compare what the model estimated against the “known” topic assignments (that’s what a simulation is actually - where we know the true values in advance in order to assess the validity of the method being tested).
book_topics <- chapter_classifications %>%
count(title, topic) %>%
group_by(title) %>%
top_n(1, n) %>%
ungroup() %>%
transmute(consensus = title, topic)
book_topics # A tibble: 4 x 2
# find 'em mismatches now.
chapter_classifications %>%
inner_join(book_topics, by = "topic") %>%
filter(title != consensus)
chapter_classifications # A tibble: 2 x 5
We see that only two chapters from Great Expectations were misclassified, as LDA described one as coming from the “Pride and Prejudice” topic (topic 1) and one from The War of the Worlds (topic 3). That’s not bad for unsupervised clustering!
We may want to take the original document-word pairs and find which words in each document were assigned to which topic. This is the job of the augment() function. See below.
## == By word assignments: augment() ==
# find which words in each document were assigned to which topic
assignments <- augment(chapters_lda, data = chapters_dtm)
assignments # tibble: 104,721 x 4. cols={doc, term, count, .topic}
# combine assignments table with consensus book titles to find incorrectly classified words.
assignments <- assignments %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>%
inner_join(book_topics, by = c(".topic" = "topic"))
assignments # tibble: 104,721 x 6. extra cols={title, chapter, consensus}
This combination of the true book (title) and the book assigned to it (consensus) is useful for further exploration.
We can, for example, visualize a confusion matrix, showing how often words from one book were assigned to another, using dplyr’s count() and ggplot2’s geom_tile.
# visualize a confusion matrix for above misclassifications
assignments %>%
count(title, consensus, wt = count) %>%
group_by(title) %>%
mutate(percent = n / sum(n)) %>%
#now ggplot it
ggplot(aes(consensus, title, fill = percent)) +
geom_tile() +
scale_fill_gradient2(high = "red") + # , label = percent_format()
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
panel.grid = element_blank()) +
labs(x = "Book words were assigned to",
y = "Book words came from",
fill = "% of assignments")
What were the most commonly mistaken words?
# What were the most commonly mistaken words?
wrong_words <- assignments %>%
filter(title != consensus)
wrong_words
Well, that’s it for now.
Sudhir